Biostatistics For Dummies, 2nd Edition (Monika Wahi, John Pezzullo)

CHAPTER 13 Taking a Closer Look at Fourfold Tables 189

When considering the consistency of a binary rating (like yes or no) for the same

item between two raters, you can estimate inter-rater reliability by having each

rater rate the same group of items. Imagine we had two raters rate the same 50

scans as yes or no in terms of whether each scan showed a tumor or not. We cross-

tabbed the results and present them in Figure 13-6.

Looking at Figure 13-6, cell a contains a count of how many scans were rated

yes — there is a tumor — by both Rater 1 and Rater 2. Cell b counts how many

scans were rated yes by Rater 1 but no by Rater 2. Cell c counts how many scans

were rated no by Rater 1 and yes by Rater 2, and cell d shows where Rater 1 and

Rater 2 agreed and both rated the scan no. Cells a and d are considered concordant

because both raters agreed, and b and c are discordant because both raters

disagreed.

Ideally, all the scans would be counted in concordant cells a or d of Figure 13-6,

and discordant cells b and c would contain zeros. A measure of how close the data

come to this ideal is called Cohen’s Kappa, and is signified by the Greek lowercase

kappa: κ. You calculate kappa as:

(

) (

)

For the data in Figure 13-6,

2 22

which is 0.5138. How is this interpreted?

If the raters are in perfect agreement, then κ = 1. If you generate completely ran-

dom ratings, you will see a κ = 0. You may think this means κ takes on a positive

value between 0 and 1, but random sampling fluctuations can actually cause κ to

be negative. This situation can be compared to a student taking a true/false test

where the number of wrong answers is subtracted from the number of right

answers as a penalty for guessing. When calculating κ, getting a score less than

zero indicates the interesting combination of being both incorrect and unfortu-

nate, and is penalized!

FIGURE 13-6:

Results of two

raters reading the

same set of

50 specimens

and rating each

specimen yes

or no.